Introducing the introduction

Structural Equation Modeling

Tommaso Feraco

Invalid Date

Outline

  • Course contents
  • Approaching the course

Topics that we will (hopefully) cover in the course

  1. lavaan
  2. Basic knowledge about SEM
  3. Models with manifest variables - path analysis
  4. Measurement models with SEM - CFA
  5. Full SEM: measurement + structural part
  6. Model invariance - MG-CFA
  7. Models for ordinal data
  8. Power analysis for SEM
  9. Other miscellaneous topics (briefly)

Why am I here?

I am convinced that SEM is a fundamental tool for research in psychology and most, if not all, researchers in this area should know it. Indeed, it is key for many aspects of your research:

  • Measurement
  • Multivariate analyses
  • Complex regression models
  • Longitudinal analyses

What you can expect from this course

I am not a statisticians. This will have negative consequences on your statistical knowledge at the end of the course, but hopefully more practical and psychology-based examples and experiences.

  • Practical example
  • Data simulation
  • Few equations
  • Open discussions
  • Hands-on your data

What I need from you

  • A PC (optional)
  • Basic R knowledge
  • for’ loops knowledge
  • Some packages installed
install.packages(c("lavaan", "semTools", 
                   "semPlot", "MASS"))

Materials and organization

  • The material is divided in arguments

  • For each argument you will find

  • Slides

  • Additional code

  • Data

  • We will probably do live coding when needed. I will work on this file: LiveCode

  • I also prepared this file where we can collect questions, if it is too early to answer or if you want to save it: Q doc

  • In general, live materials are in this folder

Moodle

Slides and materials are in the Moodle page of the course

OR moodle psicologia unipd > Formazione Post Lauream > Corsi di Dottorato > Psychological Sciences aa 2023/2024 > Structural Equations

Registro didattico

logbook: Please fill the logbook everyday.

Introduction

Outline

  • Basics
  • Variables and relationships
  • Steps
  • Basic concepts
  • Graphics
  • SEM world
  • Appendix
  • new section

How do you fit these?

SEM = Structural Equation Modeling

  • SEM is a multivariate statistical modeling technique

    • it includes path-analysis, causal models, factorial models, measurement models, Latent Growth Models, but even simple multiple regression or ANOVA could be considered particular cases of SEM.
    • All these techniques use the covariance matrix (\(\boldsymbol{S}\)) for estimating target model parameters.
  • SEM allows us to test a hypothesis/model about the data

    • we postulate a data-generating model
    • we evaluate whether this model fits the data or not
  • What is so special about SEM?

    • we can model latent variables (e.g., ‘invisible constructs’)
    • we can test indirect and reciprocal effects and more
    • last but not least, we can make diagrams (or PAINTINGS if theory is weak!)

Variance-covariance matrices

options(digits = 2)
cov(PoliticalDemocracy[1:7])
    y1   y2   y3   y4  y5   y6   y7
y1 6.9  6.3  5.8  6.1 5.1  5.7  5.8
y2 6.3 15.6  5.8  9.5 5.6  9.4  7.5
y3 5.8  5.8 10.8  6.7 4.9  4.7  7.0
y4 6.1  9.5  6.7 11.2 5.7  7.4  7.5
y5 5.1  5.6  4.9  5.7 6.8  5.0  5.8
y6 5.7  9.4  4.7  7.4 5.0 11.4  6.7
y7 5.8  7.5  7.0  7.5 5.8  6.7 10.8

SEM works with matrices

  • \(\boldsymbol{S}\) observed var-cov

  • \(\boldsymbol{\Sigma}\) true var-cov

  • \(\boldsymbol{\hat{\Sigma}}\) model-implied var-cov

  • \(\boldsymbol{\Sigma}(\theta)\)

THE MAIN AIM OF SEM IS TO RECONSTRUCT THE TRUE VARIANCE-COVARIANCE MATRIX

Classification of variables

Variables are the way those attributes that vary across individuals are operationalized and represented for further data processing. These can be categorized according to many criteria (e.g, dependent/independent…), but in SEM we classify them firstly as:

  • Latent variables

    • hypothetical variables that correspond to more or less abstract concepts

    • formative or reflective

    • examples are intelligence, anxiety, executive functions, personality traits…

  • Observed variables

    • variables that can be directly observed and measured

    • examples can be weight, height, gender, income…

Classification of variables

In SEM we also have an additional type of classification:

  • Exogenous variables

    • Variables whose causes lie outside the model; they will be used only as predictors in the model. They do not receive arrows.

    • They are indicated with \(x\), if observed, or with \(\xi\), if latent.

  • Endogenous variables

    • Variables that are determined by variables within the model (they receive arrows); can be used as predictors or dependent variables in the model.

    • They are indicated with \(y\), if observed, or with \(\eta\), if latent.

This brings us to deepen the relationships between variables.

Relationships between variables

  • The general aim of statistical analysis is to study relationship among variables

  • On the basis of the relationship among the variables, we distinguish two kind of models: symmetrical and asymmetrical.

Asymmetrical relationships

X -> Y

  • Variables are divided into two sets: dependent or response variables and predictors or explanatory variables

  • X is the set of explanatory variables, \(Y\) is the set of response variables, arrows represents the direction of the hypothesized relationship.

  • These models imply cause-and-effect relationships.

Example

People who study more obtain higher grades.

Symmetrical relationships

\[ X_i \Leftrightarrow Y_j \quad \forall i,j \]

  • This means that neither a variable causes the other, neither a variable can be considered prior in time to the other; all these relationships are bidirectional.

  • These models do not imply nor consider causality.

Example

People who have higher grades in math have higher grades in art.

Regression model

Asymmetrical relationships are usually tested with regressions!

As you remember, regression models can be written, using classical formulation, as the expression below and graphically depicted (getting closer to SEM) like this:

More regression?

But what if we have in mind a more complex pattern of relationships? What if we have more regression models in mind and need to estimate all of them contemporarily?

What we need is a system of equations.

More regression?

This system can also be drawn with SEM notation, but is actually the same…just better!

Covariance matrix

The covariance matrix is the input for the estimation process. In general, given \(q\) exogenous (\(x\)) and \(p\) endogenous (\(y\)) variables, the covariance matrix will be:

In which the diagonal elements are variances and off diagonal elements are covariances.

Variables and errors

  • Variables

    • \(x\) exogenous observed (\(q\))

    • \(\xi\) exogenous latent (\(n\))

    • \(y\) endogenous observed (\(p\))

    • \(\eta\) endogenous latent (\(m\))

  • Stochastic errors

    • \(\delta\) measurement errors in \(x\)

    • \(\epsilon\) measurement errors in \(y\)

    • \(\zeta\) equation errors in the structural relationship between \(\eta\) and \(\xi\)

SEM matrices - lavaan model

  • Parameter matrices

    • \(\boldsymbol{\Lambda}\) relationship between latent (\(\xi\) and \(\eta\)) and observed (\(x\) and \(y\)) [\((p + q) X (m + n)\)]

    • \(\boldsymbol{B}\) relationship between latent variables [\((m + n) X (m + n)\)]

  • Covariance matrices

    • \(Cov\)(\(\zeta\), \(\xi\)) = \(\boldsymbol{\Psi}\) matrix [\((m + n) X (m + n)\)]

    • \(Cov\)(\(\epsilon\), \(\delta\)) = \(\boldsymbol{\Theta}\) matrix [\((p + q) X (p + q)\)]

SEM equations

The SEM model in its most general form consists of two parts

  • The measurement model

    • \(x = \boldsymbol{\Lambda}_x\boldsymbol{\xi} + \boldsymbol{\delta}\)

    • \(y = \boldsymbol{\Lambda}_y\boldsymbol{\eta} + \boldsymbol{\epsilon}\)

  • The structural model

    • \(\boldsymbol{\eta} = \boldsymbol{B\eta} + \boldsymbol{\Gamma\xi} + \boldsymbol{\zeta}\)

    • \(\boldsymbol{\eta} = \boldsymbol{B(\eta\xi} + \boldsymbol{\zeta})\)

SEM assumptions

  • Expected values of latent variables and stochastic errors are 0:

    • \(E\)(\(\eta\)) = 0

    • \(E\)(\(\xi\)) = 0

    • \(E\)(\(\zeta\)) = 0

    • \(E\)(\(\epsilon\)) = 0

    • \(E\)(\(\delta\)) = 0

  • Errors are uncorrelated with latent variables and are mutually uncorrelated:

SEM steps

There are 5 principal steps in Structural Equation Modeling:

  1. model specification

  2. model identification

  3. parameters estimation

  4. testing

  5. model modification

As usual, these steps are like a cycle: when you arrive at step 5 you can always come back to step 1.

1. Model specification

Aim of the model

  • We fit models because we want to better understand the data and the process of data generation (to better understand this we will use simulation…simulate, simulate, simulate!)

What is a model

  • A model is a formal representation of a theory and is composed by a set of parameters that we will estimate.

Examples

2. Model identification

Basically, we want to know if there is enough information to identify a solution (aka estimate all the unknown parameters).

A model can be:

  • Under-identified
  • Just-identified
  • Over-identified

2. Model identification

Basically, we want to know if there is enough information to identify a solution (aka estimate all the unknown parameters).

A model can be:

  • Under-identified: there are MORE parameters to be estimated than elements in the covariance matrix (e.g., \(X + Y = 10\))

  • Just-identified: the number of parameters to be estimated equals the number of elements in the covariance matrix (\(df = 0\))

  • Over-identified: there are LESS parameters to be estimated than elements in the covariance matrix (\(df > 0\))

3. Model identification

To ensure that the number of unknown parameters (\(t\)) is not greater than the number of nonredundant elements in the covariance matrix of \(q\) observed variables. We can use the following formula:

\[ t \leq \frac{q(q+1)}{2} \]

An under-identified model

To ensure that the number of unknown parameters (\(t\)) is not greater than the number of nonredundant elements in the covariance matrix of \(q\) observed variables. We can use the following formula:

\[ t \leq \frac{q(q+1)}{2} \]

An over-identified model

To ensure that the number of unknown parameters (\(t\)) is not greater than the number of nonredundant elements in the covariance matrix of \(q\) observed variables. We can use the following formula:

\[ t \leq \frac{q(q+1)}{2} \]

3. Parameter estimation

To estimate the model parameters we can use different estimation methods. These aim to estimate the model implied (theoretical) correlation matrix \(\boldsymbol{\Sigma}\), which is a function of the model parameters, and should hopefully be similar to the observed correlation matrix \(\boldsymbol{S}\).

Some of the many estimation methods are:

  • Maximum Likelihood (ML), default in lavaan

  • Unweighted Least Squares (ULS)

  • Generalized Least Squares (GLS)

  • Diagonally Weighted Least Squares (DWLS), default for ordinal variables in lavaan

4. Model evaluation

Is the model adequate? Are our parameter able to construct a theoretical matrix (\(\boldsymbol{\Sigma}\)) which is close to the original empirical covariance matrix \(\boldsymbol{S}\)?

This is the goal of a good model: reproduce, from a set of theoretical associations/effects (aka covariance matrix), the original covariance matrix.

Formally:

\[ H_0 : \boldsymbol{\hat{\Sigma}}(\theta) = \boldsymbol{\Sigma} \] where \(\boldsymbol{\Sigma}\) is the true covariance matrix among model variables, \(\theta\) the parameters vector, and \(\boldsymbol{\hat{\Sigma}}\) the reproduced covariance matrix.

5. Model modification

At this point you are free to modify the model based on the results obtained…AND THE THEORY!

A full representation

img credits to dr. Johnny Lin

Graphical representation

If that all seemed difficult and boring, now comes the funny part: colors, figures, and arrows!

Graphical representation is a key attribute of structural equation modeling:

  • It helps understanding the model

  • It helps thinking and reasoning about the model (a priori)

  • It helps writing and formalizing the model

  • It is easy, but few rules must be followed to have a readable model

Graphical representation

  • Latent variables are circles or ellipses

  • Manifest/observed variables are square or rectangular boxes

  • Errors are represented by corresponding letters (or values) only

\[ \delta_1 / \epsilon_1 / \zeta_1 \]

Graphic relationships

  • All model relationships are represented by arrows;

     NO relationship NO arrow... 
    
     ...and usually NO arrow NO relationship
  • Each arrow is a model parameter and has two indices (e.g., \(\beta_{21}\))

  • Asymmetrical relationship are represented by a single headed arrow: the first index indicates the variable the arrow is pointing to, the second index indicates the variable of origin.

  • Symmetrical relationships are represented by double-headed arrows and two indices, one for each variable.

Graphic relationships

A summary

Asymmetrical relationships

Symmetrical relationships

Graphical errors

  • All errors have a single headed arrow pointing to a variable; all variables, except \(\xi\), may have an error.

  • Double-headed arrows associated to errors indicate error variances.

A full representation

img credits to dr. Johnny Lin

Univariate regressions

Multivariate regressions

Path analysis

Confirmatory factor analysis (CFA)

SEM path analysis

t test with latent variables

Cross-lagged panel models

Growth curve models

And much more

…and much more

THERE IS EVEN A JOURNAL ON SEM

Structural Equation Modeling: A Multidisciplinary Journal